Skip to content

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949

Open
AndreasKaratzas wants to merge 26 commits intovllm-project:mainfrom
ROCm:akaratza_optimize_docker_build
Open

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
AndreasKaratzas wants to merge 26 commits intovllm-project:mainfrom
ROCm:akaratza_optimize_docker_build

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 13, 2026

Summary

Implements the three-tier Docker build described in the RFC (#34994) for ROCm CI. Every PR currently rebuilds RIXL, DeepEP, rocshmem, torchcodec, and RDMA libraries from scratch — costing ~11–19 minutes of pure overhead per build. This PR introduces a pre-built Tier-1 ci_base image that absorbs those stable layers. Per-PR builds then only rebuild the thin vLLM wheel + workspace layer (~2–3 minutes).

Image registry layout after this PR:

Tag Built by Frequency
rocm/vllm-dev:base Dockerfile.rocm_base Monthly
rocm/vllm-dev:ci_base ci_base stage (this PR) Weekly
rocm/vllm-ci:$COMMIT test stage (this PR) Per PR

docker/Dockerfile.rocm — global ARG CI_BASE_IMAGE
+ARG CI_BASE_IMAGE=rocm/vllm-dev:ci_base

Docker requires any ARG used in a FROM instruction to be declared before the first FROM in the file. CI_BASE_IMAGE controls which image the test stage inherits from. The default rocm/vllm-dev:ci_base points to the stable weekly-built Tier-1 image on Docker Hub. When building --target ci_base (the weekly scheduled build), this ARG is irrelevant because the ci_base stage inherits from base, not CI_BASE_IMAGE.

docker/Dockerfile.rocm — new ci_base stage
+FROM base AS ci_base

A new intermediate Docker build stage inserted between the existing export_vllm_wheel_release and test stages. Everything in this stage is stable — it changes only when pinned dependency branches (e.g. RIXL_BRANCH, DEEPEP_BRANCH) or the upstream base image change. Building this stage takes ~10–18 minutes; by caching it as rocm/vllm-dev:ci_base, every per-PR test build avoids repeating that work.

What goes into ci_base and why:

Layer Why here and not in test
RIXL wheel (build_rixl) Pinned branch build — changes only with RIXL_BRANCH
DeepEP wheel (build_deepep) Pinned branch build — changes only with DEEPEP_BRANCH
rocshmem libs (build_rocshmem) Pinned branch build — changes only with ROCSHMEM_BRANCH
RDMA apt libs (librdmacm1, libibverbs1, …) Stable system deps — change only with distro upgrades
FFmpeg dev libs Required for torchcodec source build; stable
torchcodec (source build) PyTorch ROCm version mismatch prevents PyPI install; slow build, rarely changes
hf_transfer + pytest-shard Pure-Python, version-stable test tooling
HF_HUB_ENABLE_HF_TRANSFER=1 Runtime env for the above
MIOPEN_DEBUG_CONV_DIRECT/GEMM=0 Static workaround for pytorch#169857
docker/Dockerfile.rocm — test stage now inherits from ${CI_BASE_IMAGE}
-FROM base AS test
+FROM ${CI_BASE_IMAGE} AS test

The test stage now starts from rocm/vllm-dev:ci_base (pulled from registry) instead of base. Docker pulls the pre-built Tier-1 image and adds only the PR-specific layers on top:

  • vLLM requirements (rocm.txt, rocm-test.txt) + the PR wheel
  • vLLM wheel copied to /opt/vllm-wheels/ for python_only_compile_rocm.sh
  • Workspace copy (/vllm-workspace) + vllm_test_utils install
  • vllm/v1 package + src/vllm layout

All of these vary every PR, so they can't be pre-baked. The net result is that the per-PR build touches only ~2–3 minutes of work instead of 15+.

When CI_BASE_IMAGE is set to the local build-stage name ci_base (e.g. in a docker buildx bake ci-base-rocm-ci run), Docker resolves it as the locally-built stage rather than pulling from the registry, so the weekly Tier-1 build itself is not broken by this change.

.buildkite/hardware_tests/amd.yaml — docker pull + new env vars
+    - docker pull rocm/vllm-dev:ci_base
     - bash .buildkite/scripts/ci-bake.sh test-rocm-ci
  env:
+    CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"
+    REMOTE_VLLM: "1"
+    VLLM_BRANCH: "${BUILDKITE_COMMIT}"

Applied to all four build steps (all-archs, gfx90a, gfx942, gfx950).

docker pull rocm/vllm-dev:ci_base
Pre-fetches the Tier-1 image before docker buildx bake runs. Without this, BuildKit would pull it lazily during the build and the pull would not appear in plain-progress output, making timing harder to diagnose. Also ensures the pull fails fast and visibly if the weekly image is missing.

CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"
Passed to docker buildx bake as a Docker --build-arg (via the _ci-rocm target in ci-rocm.hcl in ci-infra). Tells Dockerfile.rocm which image to use as the base of the test stage. Without this, docker buildx bake would use the HCL variable default, which may not match the registry tag.

REMOTE_VLLM: "1"
Dockerfile.rocm has two code paths for getting the vLLM source:

  • REMOTE_VLLM=0 (default, local dev): COPY from the local build context.
  • REMOTE_VLLM=1 (CI path): git clone $VLLM_REPO --branch $VLLM_BRANCH inside the build.

In CI the build agent runs docker buildx bake from the vllm repo checkout, but the vLLM source to be tested is the PR commit. Setting REMOTE_VLLM=1 tells the Dockerfile to clone from GitHub at build time using VLLM_BRANCH below.

VLLM_BRANCH: "${BUILDKITE_COMMIT}"
The git ref that Docker clones when REMOTE_VLLM=1. Set to the exact commit SHA being tested (Buildkite expands ${BUILDKITE_COMMIT} at pipeline-step evaluation time). Without this, the build would check out main regardless of which PR commit triggered the build — producing a test image that does not match the PR.

.buildkite/hardware_tests/amd-ci-base.yaml — new scheduled pipeline

New pipeline file that runs weekly (or on-demand when dependency branches change) to build and push rocm/vllm-dev:ci_base.

commands:
  - export DATED_TAG="rocm/vllm-dev:ci_base-$(date +%Y%m%d)"
  - export IMAGE_TAG="$DATED_TAG"
  - export CI_BASE_IMAGE_TAG_DATED="$DATED_TAG"
  - bash .buildkite/scripts/ci-bake.sh ci-base-rocm-ci
env:
  CI_BASE_IMAGE_TAG: "rocm/vllm-dev:ci_base"
  DOCKERHUB_CACHE_TO: "rocm/vllm-ci-cache:rocm-latest"

DATED_TAG / CI_BASE_IMAGE_TAG_DATED
The dated snapshot tag (e.g. rocm/vllm-dev:ci_base-20250330). Computed at runtime so it always reflects today's date. Used for rollback: if the weekly build introduces a regression, you can pin CI_BASE_IMAGE to a specific dated tag in amd.yaml without rebuilding.

IMAGE_TAG (set to the dated tag)
ci-bake.sh checks whether $IMAGE_TAG already exists in the registry before building. Setting it to the dated tag means the weekly build always runs on a new day (the dated tag does not exist yet), and re-running on the same day is idempotent (skipped).

CI_BASE_IMAGE_TAG
The stable rocm/vllm-dev:ci_base tag that per-PR builds pull. Always pushed by this pipeline, so it always points to the most recent weekly build.

DOCKERHUB_CACHE_TO: "rocm/vllm-ci-cache:rocm-latest"
Tells BuildKit to write the layer cache to Docker Hub after the weekly build. This seeds the :rocm-latest cache tag so that per-PR builds (which read from this cache) get warm layers from a recent main-branch state.


This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.

cc @kenroche @okakarpa @tjtanaa @gshtras @khluu

Co-authored-by: Claude claude@anthropic.com

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.

Comment on lines 406 to 407
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +491 to +492
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +85 to +96
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.

Suggested change
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
# Check if ${BUILDER_NAME} already exists and is using the socket
if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
echo "Using existing builder: ${BUILDER_NAME}"
docker buildx use "${BUILDER_NAME}"
else
echo "Creating builder '${BUILDER_NAME}' with remote driver"
docker buildx create \
--name "${BUILDER_NAME}" \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
apt-transport-https ca-certificates wget curl
apt-transport-https ca-certificates wget curl \
ccache mold \
&& update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, mb, I updated the comment there as well.

RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/uv \
cd vllm \
&& uv pip install --system -r requirements/rocm-build.txt \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is rocm-build.txt being used in the docker build?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an oversight on my part, thought it was just rocm.txt. I updated that as well.

COPY requirements/rocm-build.txt requirements/rocm-build.txt
COPY pyproject.toml setup.py CMakeLists.txt ./
COPY cmake cmake/
COPY csrc csrc/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)

@AndreasKaratzas AndreasKaratzas marked this pull request as draft March 13, 2026 18:27
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…e condition in highly concurrent max job settings

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 18, 2026 05:38
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 19, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

@mawong-amd Let's check if Kernels Core Operation Test passes as well. We may need to bring back the compilation of triton_kernels. Not sure yet.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot removed the needs-rebase label Mar 26, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants